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Abstract Recognizing 3D objects in the presence of 
noise, varying mesh resolution, occlusion and clutter 
is a very challenging task. This paper presents a novel 
method named Rotational Projection Statistics (RoPS). 
It has three major modules: Local Reference Frame 
(LRF) definition, RoPS feature description and 3D ob- 
ject recognition. We propose a novel technique to de- 
fine the LRF by calculating the scatter matrix of all 
points lying on the local surface. RoPS feature descrip- 
tors are obtained by rotationally projecting the neigh- 
boring points of a feature point onto 2D planes and 
calculating a set of statistics (including low-order cen- 
tral moments and entropy) of the distribution of these 
projected points. Using the proposed LRF and RoPS 
descriptor, we present a hierarchical 3D object recogni- 
tion algorithm. The performance of the proposed LRF, 
RoPS descriptor and object recognition algorithm was 
rigorously tested on a number of popular and publicly 
available datasets. Our proposed techniques exhibited 
superior performance compared to existing techniques. 
We also showed that our method is robust with re- 
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spect to noise and varying mesh resolution. Our RoPS 
based algorithm achieved recognition rates of 100%, 
98.9%, 95.4% and 96.0% respectively when tested on 
the Bologna, UWA, Queen's and Ca' Foscari Venezia 
Datasets. 

Keyv^fords Surface descriptor • Local feature • Local 
reference frame • 3D representation • Feature matching • 
3D object recognition 



1 Introduction 

Object recognition is an active research area in com- 
puter vision with numerous applications including nav- 
igation, surveil lance, automation, biometrics, surgery 
and education |Guo et al. . 2013ct Johnson and Hebert . 
1999t iLei et al.i boij iTombari et al.l . I2OIOI) . The aim 
of object recognition is to correctly identify the objects 
that are present in a scene and recover their p oses (i.e., 
position and orientation) (JMian et al. . 2006b). Beyond 
object recognition from 2D images ( Brown and Lowel 
2OO3I: iLowel I2OO4I : iMikolaiczvk and Schmidl . l2004l) ~3D 
object recognition has been extensively investigated dur- 
ing the last two decades due to the availability of low 



cost scanners and high speed computing devices ([Mamie and Bennami 
20021) . However, recognizing objects from range images 
in the presence of noise, varying mesh resolution, oc- 
clusion and clutter is still a challenging task. 

Existing algorithms for 3D object recognition can 
broadly be classified into two categories, i.e., global fea- 



ture based and local feature b ased algorithms (JBavramoglu and Alata 
20ld ICastellani et all . I2OO8I) . The global feature based 
algorithms construct a set of features which encode the 
geometric properties of the entire 3D object. Examples 
of these algori t hms i nclude the geometric 3D moments 
(jPaauet et al.l . l2000l ). shape distribution (jOsada et al.l . 
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20021) and spherical harmonics (JFunkhouser et al.l . l2003l) . isting feature descriptors stih suff er from either low d e- 



However, these algorithms require complete 3D mod- 
els and are therefore sensitive t o occlusion and clutter 
( Bavramoglu and AlatanL 120101) . In contrast, the local 
feature based algorithms define a set of features which 
encode the characteristics of the local neighborhood of 
feature points. The local feature based algorithms are 
robust to occlusion and clutter. They are therefore even 
suitable to r ecognize partially visibl e obje cts in a clut- 



scriptivcness or weak robustness (jBariva et al.l . 120121) 



tered scene ( Petrelli and Di Stefand . l201lf) 



A number of local feature based 3D object recogni- 
tion algorithms have been prop osed in the literature, in - 
cluding point signa ture based (|Chua and JarvisL 119971) 



based 



Mian et al 



based ( Bariva et al 



spin i mage based ( Johnson and Hebertl . Il999l ). tensor 



2006 



201 



and Exponential Map (EM) 
algorithms. Most of these al- 
gorithms follow a paradigm that has three phases, i.e., 
feature matching, hypoth esis generation and verifica - 



tion, and pose refinement (JTaati and Greenspanll2011l ) 



In this paper we present a highly descriptive and 
robust feature descriptor together with an efficient 3D 
object recognition algorithm. This paper first proposes 
a unique, rcpcatablc and robust LRF for both local 
feature description and object recognition (Section |31). 
The LRF is constructed by performing an eigenvalue 
decomposition on the scatter matrix of all the points 
lying on the local surface together with a sign disam- 
biguation technique. A novel feature descriptor, namely 
Rotational Projection Statistics (RoPS), is then pre- 
sented (Section|3]). RoPS exhibits both high discrimina- 
tive power and strong robustness to noise, varying mesh 
resolution and a set of deformations. The RoPS feature 
descriptor is generated by rotationally projecting the 
neighboring points onto three local coordinate planes 
and calculating several statistics (e.g, central moment 
and entropy) of the distribution matrices of the pro- 
jected points. Finally, this paper presents a novel hier- 
archical 3D object recognition algorithm based on the 



Among these phases, feature matching plays a critical 

role since it directly affects the eff ectiveness and effi- 

n,, , , , , rz, 7: T7^ I proposed LRF and RoPS feature descriptor (Section 

cienc y 01 the two subsequent phases (|laati and Greenspanl . ^ ^ ^. . j_ ^ ijj.j. 

201 ill . " " "~ '"" 



Descriptiveness and robustness of a feat ure descrip- 



[S]). Comparative experiments on four popular datasets 
were performed to demonstrate the superiority of the 
propo sed method (Section [T]). 



tor ar e crucial for accurate feature matching (JBariva and Nishindhp rest of this paper is organized as follows. Section 



20101) . The feature descriptors should be highly descrip 
tive to ensure an accurate and efficient object recogni- 
tion. That is because the accuracy of feature matching 
directly influences the quality of the estimated transfor- 
mation which is used to align the model to the scene, 
as well as the computat ional time required f o r ver i- 
fication and refinement (JTaati and Greenspanl 120111 ). 
Moreover, the feature descriptors should be robust to 
a set of nuisances, including noise, varying mesh res- 
olution, clutter, occlusion, holes and topol ogy changes 



(JBronstein et all . 12010a]: iBover et all . I2011I ). 



[5] provides a brief literature review of local surface fea- 
ture descriptors and 3D object recognition algorithms. 
Section [3] introduces a novel technique for LRF defini- 
tion. Section 2] describes our proposed RoPS method 
for local surface feature description. Section [5] presents 
the evaluation results of the RoPS descriptor on two 
datasets. Section [6] introduces a RoPS based hierar- 
chical algorithm for 3D object recognition. Section [7] 
presents the results and analysis of our 3D object recog- 
nition experiments on four datasets. Section|5]concludes 
this paper. 



A number of local feature descriptors exist in litera- 
ture (Section l^TTj) . These descriptors can be divided into 
two broad categories based on whether they use a Lo- 
cal Reference Frame (LRF) or not. Feature descriptors 
without any LRF use a histogram or the statistics of the 
local geometric information (e.g., normal, curvature) to 
form a feature descriptor (Section I2.1.ip. Examples of 
this c ategory include surface signature ( Yamanv and Faraa , 
2002!) ■ Local Surfac e Patch (LSP) dChen and Bhanu 



2 Related Work 

This section presents a brief overview of the existing 
main methods for local surface feature description and 
local feature based 3D object recognition. 



20071) and THRIFT (jFlint et al.l . l2007l) . In contrast, fea- 2.1 Local Surface Feature Description 



ture descriptors with LRF encode the spatial distribu- 
tion and/or geometric information of the neighboring 
points with respect to the de fined LRF (Section 12. 1.21). 



2.1.1 Features without LRF 



Examples include spin image ( Johnson and Hebertl . 



19991 ). IStein and Medionil (jl992l ) proposed a splash feature by 



Intrin sic Shape Signatures (IS S) (IZhongll2009l ) and Mesh- 
HOG (jZaharescu et al.l . l2012l ). However, most of the ex- 



recording the relationship between the normals of the 
geodesic neighboring points and the feature point. This 
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relationship is then encoded into a 3D vector and fi- 
nally transformed into curvatures and torsion angles. 
Hetzel et al.l (J2001r) constructed a set of features by gen- 
erating histograms using depth values, surface normals, 
shape indices and their combinations. Results show that 
the surface normal and shape index exhibit high dis- 
crimination capabilities. lYamanv and Farad (J2002I) in- 
troduced a surface signature by encoding the surface 
curvature information into a 2D histogram. This method 
can be used to estimate scaling trans formations as well 
as rec ognizing objects in 3D scenes. IChen and Bhanul 
(J2007I) proposed a LSP feature that encodes the shape 
indices and normal deviations of the neighboring points. 
Flint et al.l (|2008[) introduced a THRIFT feature by cal- 
culating a weighted histogram of the deviation angles 
between the n ormals of th e neigh boring points and the 
feature point. iTaati et al.l (|2007l ) considered the selec- 
tion of a good local surface feature for 3D object recog- 
nition as an optimization problem and proposed a set 
of Variable-Dimensional Local Shape Descriptors (VD- 
LSD). However, the process of selecting an optimized 
subset of VD-LSDs for a specifi c objec t i s very time con 



erence axis and generated a 3D Shape Context (3DSC) 
by counting the weighted number of points falling in 
the neighboring 3D spherical space. However, a refer- 
ence axis is not a complete reference frame and there 
is an uncertainty in the rota tion around the normal 
|Petrelh and Di Stefanol . l201l[) . 



Sun and Abidil (|200lh introduced an LRF by using 



the normal of a feature point and an arbitrarily cho- 
sen neighboring point. Based on the LRF, they pro- 
posed a descriptor named point's fingerprint by pro- 
jecting the geodesic circles onto the tangent plane. It 
was reported that their approach outperforms the 2D 
histogram based methods. One major li mitation of this 



meth o d is that the i r LRF is not unique (JTombari et al. 



20101 ). iMian et al.l ( 2006br ) proposed a tensor represen 



suming (JTaati and Greenspanl . 120111) . iKokkinos et al 



tation by defining an LRF for a pair of oriented points 
and encoding the intersected surface area into a mul- 
tidimensional table. This representation is robust to 
noise, occlusion and clutter. However, a pair of points 
are required to d efine an LRF , which causes a combina- 
torial explosion ( Zhona . 12009 ) . iNovatnack and Nishind 
(J2008l) used the surface normal and a projected eigen- 
vector on the tangent plane to define an LRF. They 
proposed an EM descriptor by encoding the surface 
normals of the neighboring points into a 2D domain. 



()2012l) proposed a generaliza tion of 2D shape context 
feature (JBelongie et al.l . l2002l) to curved surfaces, namely 

Intrinsic Shape Context (ISC). The ISC is a meta-descriptorxhe efl?ectiveness of exploiting geometric scale variabil 
which can be applied to any photometric or geometric 
field defined on a surface. 

Without LRF, most of these methods generate a 
feature descriptor by accumulating certain geometric 
attributes (e.g., normal, curvature) into a histogram. 
Since most of the 3D spatial information is discarded 
during the process of histogramming, the descriptive- 



ness o f the features without LRF is limited (jTombari et al. 
2010l) . 



2.1.2 Features with LRF 



ity in the EM descriptor has been demonstrated. IZhone 
(J2009I) introduced an LRF by calculating the eigenvec- 
tors of the scatter matrix of the neighboring points 
of a feature point, and proposed an ISS feature by 
recording the point distribution in the spherical an- 
gular space. Since the sign of the LRF is not defined 
unambiguously, four feature de scriptors can be g ener- 
ated from a single feature point. iMian et al.l ([20101) pro- 
posed a ke ypoint detect ion method and used a simi- 
lar LRF to Zhong ( 2009f ) for their feature description. 
Tombari et al.l ( 20101 ) analyzed the strong impact of 



Chua and JarvisI ( 1997 ) proposed a point signature by 
using the distances from the neighboring points to their 
corresponding projections on a fitted plane. One merit 
of the point signature is that no surface derivative is re- 
quired. One of its limitations relate to the fact that the 
reference direction ma y not be unique. I t is also sensi 



LRF on the performance of feature descriptors and in- 
troduced a unique and unambiguous LRF by perform- 
ing an eigenvalue decomposition on the scatter matrix 
of the neighboring points and using a sign disambigua- 
tion technique. Based on the proposed LRF, they in- 
troduced a feature descriptor called Signature of His- 
tograms of OrienTations (SHOT). SHOT is very ro- 



tiyeto mesh resolution dMian et al.L^oToh . I Johnson and Ke^^ to noise, but sensitive to mesh resolution varia 



(|1998() used the surface normal as a reference axis and 
proposed a spin image representation by spinning a 2D 
image about the normal of a feature point and sum- 
ming up the number of points falling into the bins of 
that image. The spin image is one of the most cited 
methods. But its descriptiveness is rela tively low an d 
it is also sensitive to mesh resolution (jZhond . 120091 ). 
Frome et al.l ( 2004 ) also used the normal vector as a ref- 



Petrelli and Di Stefanol (J201ll) proposed a novel 



tion. 

LRF which aimed to esti mate a repeatable LRF at the 
border of a range image. IZaharescu et al.l (|2012l) pro- 
posed a MeshHOG feature by first projecting the gra- 
dient vectors onto three planes defined by an LRF and 
then calculating a two-level histogram of these vectors. 

However, none of the existing LRF definition tech- 
niques is simultaneously unique, unambiguous, and ro- 
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bust to noise and mesh resolution. Besides, most of the 
existing feature descriptors suffer from a number of hm- 
itation s, including a low ro bustness and discriminating 
power (JBariva et all 120121) . 



2.2 3D Object Recognition 

Most of the existing algorithms for local feature based 
3D object recognition follow a three-phase paradigm 
including feature matching, hyp othesis generation and 
verifi cation, and pose refinement (JTaati and Greenspan , 

2nii|}. 



Stein and Medionil (|1992I ) used the splash features 



to represent the objects and generated hypotheses by 
using a set of triplets of feature correspondences. These 
hypotheses are then grouped into clusters using geo- 
metric constraints. They are finally verified thro ugh a 
least square calculation. IChua and JarvisI (|1997l ) used 
point signatures of a scene to match them against those 
of their models. The rigid transformation between the 
scene and a candidate model was then calculated using 
three pairs of corresponding points. Its ability to rec- 
ognize objects in both single-object and multi-object 
scenes has been demonstrated. However, verifying each 
trip let of feature correspondenc es is very time consum- 
ing. IJohnson and Heberti (|l999f ) generated point corre- 
spondences by matching the spin images of the scene 
with the spin images of the models. These point cor- 
respondences are first grouped using geometric consis- 
tency. The groups are then used to calculate rigid trans- 
formations, which are finally be verified. This algorithm 
is robust to clutter and occlusion, and capable to recog- 
nize o bjects in complicated real scenes. I Yamanv and Farad 
(|2002l) used surface signatures a s feature descriptors 
and a dopted a similar strateg y to Johnson and Hebert 
|l999l) for object recognition. iMian et all |2006bl) ob- 
tained feature correspondences and model hypothesis 
by matching the tensor representations of the scene 
with those of the models. The hypothesis model is then 
transformed to the scene and finally ve rified using the 
Iterative Closest Point (ICP) algorithm (JBesl and McKav . 
19921) . Experimental results revealed that it is supe- 
rior in terms of recognition rate and effi ciency com 



based on the EM feature descriptor and a constrained 
interpretation tree. 

There are some algorithms in the literature which 
do not follow the aforementioned three-phase paradigm. 
For example, iFrome et al.l ([200J) performed 3D object 
recognition using the sum of the distances between the 
scene features (i.e. 3DSC) and their corresponding model 
features. This algorithm is efficient. However, it is not 
able to segment the recognized object from a scene, 
and its effectiveness on real da t a has not been demon- 
strated. IShang and GreenspanI (2010|) proposed a Po- 
tential Well Space Embedding (PWSE) algorithm for 
real-time 3D object recognition in sparse range images. 
It cannot however handle clutter and therefore requires 
the objects to be segmented a priori from the scene. 

None of the existing object recognition algorithms 
has explicitly explored the use of LRF to boost the per- 
formance of the recognition. Moreover, most of these al- 
gorithms require three pairs of feature correspondences 
to establish a transformation between a model and a 
scene. This not only increases the run time due to the 
combinatorial explosion of the matching pairs, but also 
decreases the precision of the estimated transformation 
(since the chance to find three correct feature corre- 
spondences is much lower compared to finding only one 
correct correspondence) . 



2.3 Paper Contributions 



pared to the spin image based algorithm. iMian et al. 



([20101) also developed a 3D object recognition algo- 
rithm based on keypoint matching. This algorithm can 
be use d to recognize objects at dif ferent and unknown 
scales. iTaati and GreenspanI ( 2011 ) developed a 3D ob- 
ject recognition algorithm based on their proposed VD- 
LSD feature descriptors. The optimal VD-LSD descrip- 
tor is selected based on the geometry of th e objects and 



the ch aracteristics of the range sensors. iBariva et al 



(|2012l) introduced a 3D object recognition algorithm 



This paper is an extended version of (JGuo et al.Ll2013aibr ) 
It has three major contributions, which are summarized 
as follows. 

i) We introduce a unique, unambiguous and robust 
3D LRF using all the points lying on the local surface 
rather than just the mesh vertices. Therefore, our pro- 
posed LRF is more robust to noise and varying mesh 
resolution. We also use a novel sign disambiguation 
technique, our proposed LRF is therefore unique and 
unambiguous. This LRF offers a solid foundation for ef- 
fective and robust feature description and object recog- 
nition. 

ii) We introduce a highly descriptive and robust 
RoPS feature descriptor. RoPS is generated by rota- 
tionally projecting the neighboring points onto three 
coordinate planes and encoding the rich information 
of the point distribution into a set of statistics. The 
proposed RoPS descriptor has been evaluated on two 
datasets. Experimental results show that RoPS achieved 
a high power of descriptiveness. It is shown to be robust 
to a number of deformations including noise, varying 
mesh resolution, rotation, holes and topology changes, 
(see Section [S] for details) . 
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iii) We introduce an efficient hierarchical 3D object 
recognition algorithm based on the LRF and RoPS fea- 
ture descriptor. One major advantage of our algorithm 
is, a single correct feature correspondence is sufficient 
for object recognition. Moreover, by integrating our ro- 
bust LRF, the proposed object recognition algorithm 
can work with any of the existing feature descriptors 
(e.g., spin image) in the literature. Rigorous evaluations 
of the proposed 3D object recognition algorithm were 
conducted on four different popular datasets. Experi- 
mental results show that our algorithm achieved high 
recognition rates, good efficiency and strong robustness 
to different nuisances. It consistently resulted in the 
best recognition results on the four datasets. 




Fig. 1: An illustration of a triangle mesh and a point ly- 
ing on the surface. An arbitrary point within a triangle 
can be represented by the triangle's vertices. 



3 Local Reference Frame 

A unique, repeatable and robust LRF is important for 
both effective and efficient feature description and 3D 
object recognition. Advantages of such an LRF are many 
fold. First, the repeatability of an LRF directly affects 
the descriptiveness and robustness of the feature de- 
scriptor, i.e., an LRF with a low repeatabil ity will result 
in a p oor performance of feature matching (|Petrelli and Di 
20111) . Second, compared with the methods which asso- 
ciate miiltipledescriptors to a single feature point (e.g., 
ISS ( Zhona . 120091) )■ a unique LRF can help to improve 
both the precisi o n and the efficiency of feature matching 
(JTombari et all . I2OIOI ). Third, a robust 3D LRF helps 
to boost the performance of 3D object recognition. 

We propose a novel LRF by fully employing the 
point localization information of the local surface. The 
three axes for the LRF are determined by performing 
an eigenvalue decomposition on the scatter matrix of 
all points lying on the local surface. The sign of each 
axis is disambiguated by aligning the direction to the 
majority of the point scatter. 



The scatter matrix C^ of all the points lying within 
the ith triangle can be calculated as: 

^ _ /o Iq^' {P^ {s, t) - p) {p, (s, t) - pf dtds 

Using Eq.[TJ the scatter matrix Ci be can expressed 
^^fanol . 



^ 3 3 

^'^-^Y.Y. (P'j - p) (P'k - pf 

j=i fc=i 
1 ^ 

T7;J2(P^l~P)iPv~PV ■ 



12 



(3) 



i=i 



The overall scatter matrix C of the local surface S is 
calculated as the weighted sum of the scatter matrices 
of all the triangles, that is: 



N 



c = y^^wiiwi2Ci, 



(4) 



3.1 Coordinate Axis Construction 

Given a feature point p and a support radius r, the 
local surface mesh S which contains A^ triangles and 
M vertices, is cropped from the range image using a 
sphere of radius r centered at p. For the ith triangle 
with vertices p^^, p^2 and p^g, a point lying within the 
triangle can be represented as: 



Pr (s, t) = p,i + s{pa - p.a) + t (p,3 - p,i) , 



(1) 



where N is the number of triangles in the local surface 
S. Here, wn is the ratio between the area of the ith 
triangle and the total area of the local surface S, that 
is: 



Wil 



\{P^2~P^l) X iPt3-Ptl)\ 
T,Zl \iP^2 -P^l)x (PiS - Pil) 



(5) 



where x denotes the cross product. 

Wi2 is a weight that is related to the distance from 
the feature point to the ccntroid of the ith triangle, that 



where < s,t < I, and s + 1 < 1. as illustrated in Fig. 

m 



Wr2 



P 



Pil + Pi 



P 



i3 



(6) 
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(a) Armadillo (b) Asia Dragon (c) Bunny (d) Dragon (e) Happy Buddha (f) Thai Statue 

Fig. 2: The six models of the Tuning Dataset. 



Note that, the first weight wn is expected to im- 
prove the robustness of LRF to varying mesh resolu- 
tions, since a compensation with respect to the triangle 
area is incorporated through this weighting. The sec- 
ond weight Wi2 is expected to improve the robustness 
of LRF to occlusion and clutter, since distant points 
will contribute less to the overall scatter matrix. 

We then perform an eigenvalue decomposition on 
the overall scatter matrix C, that is: 

CV = EV, (7) 

where E is a diagonal matrix of the eigenvalues {Ai, A2, A3} 
of the matrix C, and V contains three orthogonal eigen- 
vectors {vi,V2,V3} that are in the order of decreasing 
magnitude of their associated eigenvalues. The three 
eigenvectors offer a basis for LRF definition. However, 
the signs of these vectors are numerical accidents and 
are not repeat able betwe e n diff e rent trials even on th e 
same surface (jBro et al.l . l2008t iTombari et al.l I2OIOI ). 
We therefore propose a novel sign disambiguation tech- 
nique which is described in the next subsection. 

It is worth noting that, although some existing tech- 
niques also use the idea of eigenvalue decomposition to 
construct the LRF (e.g., ( Mian et al.l . l2010tlTombari et al 
2010t IZhona 120091 )1 they calculate the scatter matrix 



using just the mesh vertices. Instead, our technique em- 
ploys all the points in the local surface and, is therefore 
more robust compared to exiting techniques (as demon- 
strated in Section [ 



3.2 Sign Disambiguation 

In order to eliminate the sign ambiguity of the LRF, 
each eigenvector should point in the major direction of 
the scatter vectors (which start from the feature point 
and point in the direction of the points lying on the 
local surface). Therefore, the sign of each eigenvector 
is determined from the sign of the inner product of the 
eigenvector and the scatter vectors. Specifically, the un- 
ambiguous vector t>i is defined as: 
v^^vi- sign (h) , (8) 



where sign (•) denotes the signum function that extracts 
the sign of a real number, and h is calculated as: 



TV 



WilWi2 



N 



^0 
3 



(p,j (s,i) — p) Vidtds 



6 



i=i 



Similarly, the unambiguous vector Vt, is defined as: 

.(10) 



n ( J2 W^lW^2 ( - Y. (Py - P) -^3 I I 



V3 = f3-sig: 



Given two unambiguous vectors Vi and ■i?3, ■U2 is de- 
fined as ■U3 X tTi. Therefore, a unique and unambiguous 
3D LRF for feature point p is finally defined. Here, p 
is the origin, and Vi, v^ and Vj, are the x, y and z axes 
respectively. With this LRF, a unique, pose invariant 
and highly discriminative local feature descriptor can 
now be generated. 

3.3 Performance of the Proposed LRF 

To evaluate the repeatability and robustness of our pro- 
posed LRF, we calculated the LRF errors between the 
corresponding points in the scenes and models. The 
six models (i.e., "Armadillo", "Asia Dragon" , "Bunny", 
"Dragon", "Happy Buddha" and "Thai Statue") used 
in this experiment w ere taken from the Stanf ord 3D 
Scanning Repository ([Curless and Levovl Il996l) . They 
are shown in Fig. [21 The six scenes were created by re- 
sampling the models down to 1/2 of their original mesh 
resolution and then adding Gaussian noise with a stan- 
dard deviation of 0.1 mesh resolution (nir) to the data. 
We refer to this dataset as the "Tuning Dataset" in the 
rest of this paper. 

We randomly selected 1000 points in each model and 
we refer to these points as feature points. We then ob- 
tained the corresponding points in the scene by search- 
ing the points with the smallest distances to the feature 
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points in the model. For each point pair {Psi,PMi)^ ^^ 
calculated the LRFs for both points, denoted as hsi 
and Ljv/t, respective ly. Using the similar criterion as in 
( Mian et al.Ll2006al) . the error between two LRFs of the 
ith point pair can be calculated by: 



trace (Lc,-L 



St^Alt) 



1\ 180 



(11) 



where ct represents the amount of rotation error be- 
tween two LRFs and is zero in the case of no error. 

Our proposed LRF technique was tested on the Tun- 
ing Dataset with comp arison to several existing tech - 
niques, e.g., propo s ed by Novatnack and Nishino ( 20081) . 

Mian et al.l|2010[ ). lTombari et al.l(|2010t) . and lPetrelh and Qi.Stefan o[ 
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(|2011l ) . We tested each LRF technique five times by ran- 
domly selecting 1000 different point pairs each time. 
The overall LRF errors of each technique are shown in 
Fig. [3] as a histogram. Ideally, all of the LRF errors 
should lie around the zero value (in the first bin of the 
histogram) . It is clear that our proposed technique per- 
formed best, with 83.5% of the point pairs having LRF 
errors less than 10 de grees. Whereas the second bes t 
one (i.e., proposed by IPetrelli and Di Stefanol (|201ll )) 
secured only 43.2% of the point pairs with LRF errors 
less than 10 degrees. Other techniques only had around 
40% point pairs with LRF errors less than 10 degrees. 
These results clearly indicate that our proposed LRF 
is more repeatable and more robust than the state-of- 
the-art in the presence of noise and mesh resolution 
variation. 

In order to further assess the influence of a weight- 
ing strategy, we used a distance weight Wis = r — 



istogram of the LRF errors for the six scenes 
and models of the Tuning Dataset. Our proposed tech- 
nique outperformed the existing techniques by a large 
margin. (Figure best seen in color.) 



Once an LRF is determined, the next step is to de- 
fine a local surface descriptor. In the next section, we 
propose a novel RoPS descriptor. 



4 Local Surface Description 

A local surface descriptor needs to be invariant to rota- 
tion and robust to noise, varying mesh resolution, oc- 
clusion, clutter and other nuisances. In this section, we 
propose a novel local surface feature descriptor namely 
RoPS by performing local surface rotation, neighboring 
points projection and statistics calculation. 



f 3 



(following the approach of (ITombari et al 



l2010r) l to replace the weights wn and 1x1^2 in Equations 
[4ll9landfT0l resulting in a modified LRF. The histogram 
of LRF errors of the modified technique is shown in Fig. 
[31 The performance of the modified LRF decreased sig- 
nificantly compared to the original proposed LRF. This 
observation reveals that the weighting strategy using 
both quadratic distance weight wa and area weight Wi\ 
produced more robust results compared to those using 
only a linear distance weight Wij, . 

Fig. [3] shows that part of the LRF errors of each 
technique are larger than 80 degrees. This is mainly 
due to the presence of local symmetrical surfaces (e.g., 
flat or spherical surfaces) in the scenes. For a local sym- 
metrical surface, there is an inherent sign ambiguity of 
its LRF because the distribution of points is almost the 
same in all directions. In order to deal with this case, we 
adopt a feature point selection technique which uses the 
ratio of eigenvalues to avoid local symmetrical surfaces 
(see Section [S^ . 



4.1 RoPS Feature Descriptor 

An illustrative example of the overall RoPS method is 
given in Fig. 21 From a range image/model, a local sur- 
face is selected for a feature point p given a support ra- 
dius r. Figures |4l[a) and (b) respectively show a model 
and a local surface. We already have defined the LRF 
for p and the vertices of the triangles in the local sur- 
face 5" constitute a pointcloud Q = {Qi, 52; ■ • ■ 7 9m}- 
The pointcloud Q = {qj^, 921 ■ • • 7 9a/} is then trans- 
formed with respect to the LRF in order to achieve 
rotation invariance, resulting in a transformed point- 
cloud Q' = {q']^, ^21 • • • 7 9m}- We then follow a number 
of steps which are described as follows. 

First, the pointcloud is rotated around the x axis by 
an angle 6*^, resulting in a rotated pointcloud Q' (0fe), 
as shown in Fig. JH^c). This pointcloud Q' (6*^) is then 
projected onto three coordinate planes (i.e., the xy, xz 
and yz planes) to obtain three projected pointclouds 
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Fig. 4: An illustration of the generation of a RoPS feature descriptor for one rotation, (a) The Armadillo model 
and the local surface around a feature point, (b) The local surface is cropped and transformed in the LRF. (c) The 
local surface is rotated around a coordinate axis, (d) The neighboring points are projected onto three 2D planes, 
(e) A distribution matrix is obtained for each plane by partitioning the 2D plane into bins and counting up the 
number of points falling into each bin. The dark color indicates a large number, (f) Each distribution matrix is 
then encoded into several statistics, (g) The statistics from three distribution matrices are concatenated to form 
a sub-feature descriptor for one rotation. (Figure best seen in color.) 



Q\{9k),i = 1,2,3. Note that, the projection offers a 
means to describe the 3D local surface in a concise and 
efficient manner. That is because 2D projections clearly 
preserve a certain amount of unique 3D geometric infor- 
mation of the local surface from that particular view- 
point. 

Next, for each projected pointcloud Q'^ (Ok), a 2D 
bounding rectangle is obtained, which is subsequently 
divided into L x L bins, as shown in Fig. IH^d). The 
number of points falling into each bin is then counted to 
yield an L x L matrix D, as shown in Fig. IH^e). We refer 
to the matrix D as a "distribution matrix" since it rep- 
resents the 2D distribution of the neighboring points. 
The distribution matrix D is further normalized such 
that the sum of all bins is equal to one in order to 
achieve invariance to variations in mesh resolution. 

The information in the distribution matrix D is fur- 
ther condensed in order to achieve computational and 
storage efficiency. In this paper, a set of statistics is ex- 
tracted from t he distribution matrix D, inc luding cen- 
tral moments ( Demi et al. . 2000t IHuI . I1962 ) and Shan- 
non entropy ( Shannonl . 1 1 9481 ) . The central moments arc 
utilized for t heir mathe matical simplicity and rich dc- 
scriptiveness (JHul . Il962l ). while Shannon entropy is se- 



lected for its strong power to measure the informatio n 
contained in a probability distribution (jShannonl . ll948( ). 
The central moment /Xmn of order tti -|- n of matrix 
D is defined as: 

L L 
1=1 j = l 

where 

L L 

z^EE'D (*'■?■)' (13) 

Z=l J = l 

and 

L L 

J = EE^D (*'■?■)■ (14) 

1=1 3 = 1 

The Shannon entropy e is calculated as: 

L L 

e^-J2J2^(hj)^og{B{z,j)). (15) 

i=i j=i 

Theoretically, a complete set of central moments can 
be used to un ique l y desc ribe the information contained 
in a matrix (JHm Il962 ). However in practice, only a 



small subset of the central moments can suSiciently rep- 
resent the distribution matrix D. These selected cen- 
tral moments together with the Shannon entropy are 
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then used to form a statisties vector, as shown in Fig. 
mjf). The three statistics vectors from the xy, xz and 
yz planes are then concatenated to form a sub-feature 
/^ [Ok)- Note that f ^ {6k) denotes the total statistics 
for the kth rotation around the x axis, as shown in Fig. 

Hg). 

In order to encode the "complete" information of 
the local surface, the pointcloud Q' is rotated around 
the X axis by a set of angles {9k} , fc = 1, 2, . . . , T, result- 
ing in a set of sub-features {/^ (^fe)} , fc ~ 1,2, ... ,T. 
Further, Q' is rotated by a set of angles around the y 
axis and a set of sub- features {/ {6 k)} ,k = 1,2, ... ,T 
is calculated. Finally, Q' is rotated by a set of angles 
around the z axis and a set of sub- features {f ^ (^fc)} , k ~ 
1,2, ... ,T is calculated. The overall feature descriptor 
is then generated by concatenating the sub-features of 
all the rotations into a vector, that is: 



/ = {/. (Ok) , fy {Ok) , /. {ek)] ,k^l,2,...,T. (16) 



It is expected that the RoPS descriptor would be 
highly discriminative (as demonstrated in Section [5]) 
since it encodes the geometric information of a local 
surface from a set of viewpoints. Note that, some exist- 
ing view -based methods c an be f ound in the lite r ature , 
such as dYamauchi et all, bood) . dOhbuchi et al.l . boOSi) 
and ( Atmosukarto and Shapirol . 120101 ). However, these 
methods are based on global features and originate from 
the 3D shape retrieval area. They are, however, not suit- 
able for 3D object recognition due to their sensitivity 
to occlusion and clutter. 



4.2 RoPS Generation Parameters 

The RoPS feature descriptor has four parameters: i) 
the combination of statistics, ii) the number of parti- 
tion bins L, iii) the number of rotations T around each 
coordinate axis, and iv) the support radius r. The per- 
formance of RoPS descriptor against different settings 
of these parameters was tested on the Tuning Dataset 
using the criterion of Recall vs 1-Precision Curve (RP 
Curve). 

RP Curve is one of the most popular criteria used 



for the assessment of a feature descriptor (IFlint et al 



2008|:IHou and Qinl . [201oilKe and Sukthankaiil2004tlMikolaiczvk and 



20051 ). It is calculated as follows: given a scene, a model 



and the ground truth transformation, a scene feature 
is matched against all model features to find the clos- 
est feature. If the ratio between the smallest distance 
and the second smallest one is less than a threshold, 
then the scene feature and the closest model feature 
are considered a match. Further, a match is considered 
a true positive only if the distance between the physical 
locations of the two features is sufficiently small, oth- 
erwise it is considered a false positive. Therefore, recall 
is defined as: 



recall 



the number of true positives 



(17) 



(18) 



Ot her related methods, howe ver, include th e spin 



imag e ( Johnson and Hebertl . ll999l) and snapshot ( Malassiotis and Strintzisl . 



total number of positives 

1-precision is defined as: 

. . the number of false positives 

1-precision ~ . 

total number of matches 

By varying the threshold, a RP Curve can be gen- 
erated. Ideally, a RP Curve would fall in the top left 
corner of the plot, which means that the feature ob- 
tains both high recall and precision. 



20071) descriptors. A spin image is generated by project- ^,2.1 The Combination of Statistics 



ing a local surface onto a 2D plane using a cylindrical 
parametrization. Similarly, a snapshot is obtained by 
rendering a local surface from the viewpoint which is 
perpendicular to the surface. Our RoPS differs from 
these methods in several aspects. First, RoPS repre- 
sents a local surface from a set of viewpoints rather than 
just one view (as in the case of spin image and snap- 
shot). Second, RoPS is associated with a unique and 
unambiguous LRF, and it is invariant to rotation. In 
contrast, spin image discards cylindrical angular infor- 
mation and snapshot is prone to rotation. Third, RoPS 
is more compact than spin image and snapshot since 
RoPS further encodes 2D matrices with a set of statis- 
tics. The typical lengths of RoPS, spin image and snap- 
shot are 135, 225 and 1600 , res pectively (see Tabled 



The selection of the subset of statistics plays an impor- 
tant role in the generation of a RoPS feature descrip- 
tor. It determines not only the capability for encap- 
sulating the information in a distribution matrix but 
also the size of a feature vector. We considered eight 
combinations of statistics (a number of low-order mo- 
ments and entropy), as listed in Table[Tl and tested the 
performance for each combination in the terms of RP 
Curve. The other three parameters were set constant 
as L = 5, T = 3 and r = 15mr. It is worth noting that 
the zeroth-order central moment /xqo and the first-order 
central moments /xpi and /^lo were excluded from the 
combinations of the statistics. Because these moments 
are constant (i.e., /loo ~ 1, Moi = and /iio = 0) and 

( Johnson and Hebertl . ll999[ ) and (jMalassiotis and Strintzisl .therefore contain no information of the local surface. 

20071 )). Our experimental results are shown in Fig. EJa). 
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Combinations of Statistics 



No. of Partition Bins 




No. of Partition Bins (magnified) 



#1 combination 
#2 combination 
#3 combination 
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— #8 combination 
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(a) 
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0.2 0.25 
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(d) 



Fig. 5: Effect of the RoPS generation parameters, (a) Different combinations of statistics, (b) The number of 
partition bins L. There is a twin plot in (b), where the right plot is a magnified version of the region indicated by 
the rectangle in the left plot, (c) The number of rotations T. There is a twin plot in (c), where the right plot is 
a magnified version of the region indicated by the rectangle in the left plot, (d) The support radius r. (We chose 
the No. 6 combination of the statistics and set L = 5, T = 3 and r = 15mr in this paper as a tradeoff between 
effectiveness and efficiency. Figure best seen in color.) 



Table 1: Different combinations of the statistics. 



No. 



Combination of the statistics 



1 /^02, /^ll, /^20 

2 /^02, Mill /^205M03: Ml2, /^21, /^30 

3 M02, Mil, M2O5MO31 M12, M2I, M30,M04, M13, M22, M31, M40 

4 M02, Mil: M2O1MO35 M12, M2I, M30,M04, Ml3: M22 , M31: M40 i e 

5 Mil- M21, Ml2- M22 

6 Mil, M21, M12, M22, e 

7 Mil. M21, M12, M22, M31, M13 

8 Mil- M21, Ml2- M22, M31, Ml3: e 



It is clear that the No. 6 combination achieved the 
best performance, followed by the No. 5 combination. 
While the No. 3, No. 4 and No. 8 combinations obtained 
comparable performance, with recall being a little lower 
than the No. 6 combination. The superior performance 
of the No. 6 combination is due to the facts that, first, 
the low-order moments /zn, H21, IJ'i2, M22 and entropy e 



contain the most meaningful and significant informa- 
tion of the distribution matrix. Consequently, the de- 
scriptivcness of these statistics is sufficiently high. Sec- 
ond, the low-order moments are more robust to noise 
and varying mesh resolution compared to the high-order 
moments. Beyond the high precision and recall, the size 
of the No. 6 combination is also small, which means that 
the calculation and matching of feature descriptors can 
be performed efficiently. Therefore, the No. 6 combina- 
tion, i.e., {/xii,//2i, A'i2j ^22, e}, was selected to repre- 
sent the information in a distribution matrix and to 
form the RoPS descriptor. 



4-2.2 The Number of Partition Bins 

The number of partition bins L is another important 
parameter in the RoPS generation. It determines both 
the descriptiveness and robustness of a descriptor. That 
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is, a dense partition of the projected points offers more 
details about the point distribution, it however increases 
the sensitivity to noise and varying mesh resolution. Wc 
tested the performance of RoPS descriptor on the Tun- 
ing Datasct with respect to a number of partition bin, 
while the two other parameters were set to T = 3 and 
r = 15mr. The experimental results are shown in Fig. 
[5l^b) as a twin plot, where the right plot is a magnified 
version of the region indicated by the rectangle in the 
left plot. 

The plot shows that the performance of RoPS de- 
scriptor improved as the number of partition bins in- 
creased from 3 to 5. This is because more details about 
the point distribution were encoded into the feature de- 
scriptor. However, for a number of partition bins larger 
than 5, the performance degraded as the number of 
partition bins increased. This is due to the reason that 
a dense partition makes the distribution matrix more 
susceptible to the variation of spatial position of the 
neighboring points. It can therefore be inferred that 5 
is the most suitable number of partitions as a tradeoff 
between the dcscriptiveness and the robustness to noise 
and varying mesh resolution. We therefore used i = 5 
in this paper. 

^.2.3 The Numbers of Rotations 

The number of rotations T determines the "complete- 
ness" when describing the local surface using a RoPS 
feature descriptor. That is, increasing the number of 
rotations means that more information of the local sur- 
face are encoded into the overall feature descriptor. We 
tested the performance of the RoPS feature descriptor 
with respect to a varying number of rotations while 
keeping the other parameters constant (i.e., r = 15mr). 
The results arc given in Fig. [Sfc) as a twin plot, where 
the right plot is a magnified version of the region indi- 
cated by the rectangle in the left plot. 

It was found that as the number of rotations in- 
creased, the dcscriptiveness of the RoPS increased, re- 
sulting in an improvement of the matching performance 
(which confirmed our assumption) . Specifically, the per- 
formance of the RoPS descriptor improved significantly 
as the number of rotations increased from 1 to 2, as 
shown in the left plot of Fig.[5l^c). The performance then 
improved slightly as the number of rotations increased 
from 2 to 6, as indicated in the magnified version shown 
in the right plot of Fig. EJc). In fact, there was no no- 
table difference between the performance with respect 
to the number of rotations of 3 and 6. That is because 
almost all the information of the local surface is encoded 
in the feature descriptor by rotating the neighboring 
points 3 times around each axis. Therefore, increasing 



the number of rotations any further will not necessarily 
add any significant information to the feature descrip- 
tor. Moreover, increasing the number of rotations will 
cost more computational and memory resources. We 
therefore, set the number of rotations to be 3 in this 
paper. 

4-. 2.^ The Support Radius 

The support radius r determines the amount of surface 
that is encoded by the RoPS feature descriptor. The 
value of r can be chosen depending on how local the 
feature should be, and a tradeoff lies between the fea- 
ture's dcscriptiveness and robustness to occlusion. That 
is, a large support radius enables the RoPS descriptor to 
encapsulate more information of the object and there- 
fore provides more dcscriptiveness. On the other hand, 
a large support radius increases the sensitivity to oc- 
clusion and clutter. Wc tested the performance of the 
RoPS feature descriptor with respect to varying sup- 
port radius while keeping the other parameters fixed. 
The results are given in Fig.[5Jd). 

The results show that the recall and precision per- 
formance of the RoPS feature descriptor improved steadily 
as the support radius increased from 5mr (mr = mesh 
resolution) to 25mr. Specifically, there was a significant 
improvement of the matching performance as the sup- 
port radius increased from 5mr to lOmr, this is because 
a radius of 5mr is too small to contain sufficient dis- 
criminating information of the underlying surface. The 
RoPS feature descriptor achieved good results with a 
support radius of 15mr, achieving a high precision of 
about 0.9 and a high recall of about 0.9. Although 
the performance of RoPS feature descriptor further im- 
proved slightly as the support radius was increased to 
25mr, the performance deteriorated sharply when the 
support radius was set to 30mr. Wc choose to set the 
support radius to 15mr in the paper to maintain a 
strong robustness to occlusion and clutter. An illustra- 
tion is shown in Fig. [6l The range image contains two 
objects in the presence of occlusion and clutter, and a 
feature point is selected near the tail of the chicken. 
The red, green and blue spheres, respectively represent 
the support regions with radius of 25 mr, 15mr and 5mr 
for the feature point. As the radius increases from 5mr 
to 25 mr, points on the surface within the support re- 
gion are more likely to be missing due to occlusion, and 
points from other objects (e.g., T-rex on the right) are 
more likely to be included in the support region due 
to clutter. Therefore, the resulting feature descriptor is 
more likely to be affected by occlusion and clutter. 

Note that, several adaptive-scale keypoint detection 
methods have been proposed for the purpose of de- 
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Fig. 6: An illustration of the descriptor's robustness to 
occlusion and clutter with respect to varying support 
radius. The red, green and blue spheres respectively 
represent the support regions with radius of 25 mr, 
15mr and Smr for a feature point. (Figure best seen 
in color.) 



termining the support r adius based on the i nherent 



scale of a feature point (jTombari et al.l . 120131 ). How- 



ever, we simply adopt a fixed support radius since our 
focus is on feature description and object recognition 
rather than keypoint detection. Moreover, our proposed 
RoPS descriptor has been demonstrated to achieve an 
even better performance compared to the methods with 
adaptive-scale keypoint detection (e.g., EM matching 
and keypoint matching), as analyzed in Section [T] 



5 Performance of the RoPS Descriptor 




Fig. 7: A scene on the Bologna Dataset. 



and translating three to five models in order to create 
clutter and pose variances. As a result, the ground truth 
rotations and translations between each model and its 
instances in the scenes were known a priori during the 
process of construction. An example scene is shown in 

Fig.m 

The performance of each feature descriptor was as- 
sessed using the criterion of RP Curve (as detailed in 
Section 14. 2p . We compared our RoPS feature descrip- 
tor with five st ate-of-the-art feature descrip tors, includ- 
ing spin image (IJohnson and Hebert ^ 19991 ) , n ormal his- 



togra m fNormHi st) (IHetzel et all. 20011) . LSP (IChen and Bhanu . 



2007), THRIFT (JFlint et al.l . l2007^ and SHOT (JTombari et al. 



20101) . The support radius r for all methods was set to 



be 15mr as a compromise between the descriptiveness 
and the robustness to occlusion. The parameters for 
generating all these feature descriptors were tuned by 
optimizing the performance in terms of RP Curve on 
the Tuning Dataset. The tuned parameter settings for 
all feature descriptors arc presented in Table [31 



The descriptiveness and robustness of our proposed RoPS 
feature descriptor was first ev aluated on the Bologna 
Dataset (JTombari et al.L 120101) with respect to different 
levels of noise, varying mesh resolution and their com- 
binations. It was also evaluated on the PHOTOMESH 



Dataset (IZaharescu et al.l . l2012l ) with respect to 13 trans- 
formations. In these experiments, the RoPS was com- 
pared to several state-of-the-art feature descriptors. 



5.1 Performance on The Bologna Dataset 

5.1.1 Dataset and Parameter Setting 

The Bologna Dataset used in this paper comprises six 
models and 45 scenes. The six models (i.e., "Armadillo" , 
"Asia Dragon" , "Bunny" , "Dragon" , "Happy Buddha" 
and "Thai Statue") were taken from the Stanford 3D 
Scanning Repository. They are shown in Fig. [51 Each 
scene was synthetically generated by randomly rotating 



Table 2: Tuned parameter settings for six feature de- 
scriptors. 





Support Radius 


Dimensionality 


Length 


Spin image 


15mr 


15*15 


225 


NomiHist 


15mr 


15*15 


225 


LSP 


15mr 


15*15 


225 


THRIFT 


15mr 


32*1 


32 


SHOT 


15mr 


8*2*2*10 


320 


RoPS 


15mr 


3*3*3*5 


135 



In order to avoid the impact of the keypoint detec- 
tion method on feature's descriptiveness, we randomly 
selected 1000 feature points from each model, and ex- 
tracted their corresponding points from the scene. We 
then employed the methods listed in Table [21 to extract 
feature descriptors for these feature points. Finally, we 
calculated a RP Curve for each feature descriptor to 
evaluate the performance. 
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(e) Noise with a standard deviation of 0.4mr (f) Noise with a standard deviation of 0.5mr 
Fig. 8: Recall vs l-Precision curves in the presence of noise. (Figure best seen in color.) 



5.1.2 Robustness to Noise 



In order to evaluate the robustness of these feature 
descriptors to noise, we added a Gaussian noise with 
increasing standard deviation of O.lmr, 0.2mr, 0.3mr, 
0.4nir and O.Smr to the scene data. The RP Curves 
under different levels of noise are presented in Fig. [S] 



We made a number of observations, i) These feature 
descriptors achieved comparable performance on noise 
free data, with high recall together with high precision, 
as shown in Fig. ^&). 

ii) With noise, our proposed RoPS feature descrip- 
tor achieved the best performance in most cases, and 
is followed by SHOT. Specifically, the performance of 
RoPS is better than SHOT under a low-level noise with 
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Fig. 9: Recall vs l-Prccision curves with respect to mesh resolution. (Figure best seen in color.) 



a standard deviation of O.lmr, as shown in Fig.ISJb). As 
the standard deviation of the noise increased to 0.2mr 
and 0.3mr, SHOT performed slightly better than RoPS, 
as indicated in Figures ISl^c) and (d). However, the per- 
formance of our proposed RoPS was significantly better 
than SHOT under high levels of noise, e.g., with a noise 
deviation larger than 0.3mr, as shown in Figures El^e) 
and (f). It can be inferred that RoPS is very robust to 
noise, particularly in the case of scenes with a high level 
of noise. 

iii) As the noise level increased, the performance of 
LSP and THRIFT deteriorated sharply, as shown in 
Figures [5]Jb-e). THRIFT failed to work even under a 
low- level of noise with a standard deviation of O.lmr. 
Th is result is also co nsistent with the conclusion given 
in ([Flint et all 120081 ) . Although NormHist and spin im- 
age worked relatively well under low- and medium-level 
noise with a standard deviation less than 0.2mr, they 
failed completely under noise with a large standard de- 
viation. The sensitivity of spin image, NormHist, THR- 
IFT and LSP to noise is due to the fact that, they rely 



on surface normals to generate their feature descrip- 
tors. Since the calculation of surface normal includes a 
process of differentiation, it is very susceptible to noise, 
iv) The strong robustness of our RoPS feature de- 
scriptor to noise can be explained by at least three facts. 
First. RoPS encodes the "complete" information of the 
local surface from various viewpoints through rotation 
and therefore, encodes more information than the ex- 
isting methods. Second, RoPS only uses the low-order 
moments of the distribution matrices to form its feature 
descriptor and is therefore less affected by noise. Third, 
our proposed unique, unambiguous and stable LRF also 
helps to increase the descriptiveness and robustness of 
the RoPS feature descriptor. 

5.1.3 Robustness to Varying Mesh Resolution 

In order to evaluate the robustness of these feature de- 
scriptors to varying mesh resolution, we resampled the 
noise free scene meshes to 1/2, V* and i/s of their original 
mesh resolution. The RP Curves under different levels 
of mesh decimation are presented in Figures IHlJa-c) . 
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It was found that our proposed RoPS feature de- 
scriptor outperformed all the other descriptors by a 
large margin under all levels of mesh decimation. It is 
also notable that the performance of our RoPS feature 
descriptor with i/s of original mesh resolution was even 
comparable to the best results given by the existing 
feature descriptors with 1/2 of original mesh resolution. 
Specifically, RoPS obtained a precision more than 0.7 
and a recall more than 0.7 with i/s of original mesh reso- 
lution, whereas spin image obtained a precision around 
0.8 and a recall around 0.8 with 1/2 of original mesh 
resolution, as shown in Figures |9l[a) and (c). This indi- 
cated that our RoPS feature descriptor is very robust 
to varying mesh resolution. 

The strong robustness of RoPS to varying mesh res- 
olution is due to at least two factors. First, the LRF of 
RoPS is derived by calculating the scatter matrix of 
all the points lying on the local surface rather than 
just the vertices, which makes RoPS robust to different 
mesh resolution. Second, the 2D projection planes are 
sparsely partitioned and only the low-order moments 
are used to form the feature descriptor, which further 
improves the robustness of our method to mesh resolu- 
tion. 



5.1.4 Robustness to Combined Noise and Mesh 
Decimation 

In order to further test the robustness of these feature 
descriptors to combined noise and mesh decimation, we 
resampled the scene meshes down to 1/2 of their original 
mesh resolution and added a Gaussian random noise 
with a standard deviation of O.lmr to the scenes. The 
resulting RP Curves are presented in Fig. [Sl^d). 

As shown in Fig. IH^d), RoPS significantly outper- 
formed the other methods in the scenes with both noise 
and mesh decimation, obtaining a high precision of about 
0.9 and a high recall of about 0.9. It is followed by 
NormHist, SHOT, spin image and LSP, while THRIFT 
failed to work. 

As summarized in Table [21 the RoPS feature de- 
scriptor length is 135, while the others such as spin 
image, NormHist, LSP and SHOT are 225, 225, 225 
and 320, respectively. So RoPS is more compact and 
therefore more efficient for feature matching compared 
to these methods. Note that, although the length of 
THRIFT is smaller than RoPS, THRIFT'S performance 
in terms of recall and precision results is surpassed by 
our RoPS feature descriptor by a large margin. 



5.2 Performance on The PHOTOMESH Dataset 

The PHOTOMESH Dataset contains three nuU shapes. 
Two of the null shapes were obtained with multi-view 
stereo reconstruction algorithms, and the other one was 
generated with a modeling program. 13 transformations 
were applied to each shape. The transformations in- 
clude color noise, color shot noise, geometry noise, ge- 
ometry shot noise, rotation, scale, local scale, sampling, 
hole, micro-hole, topology changes and isometry. Each 
transformation has five different levels of strength. 



To make a rigorous comparison with (jZaharescu et al. 



2012I) . we set the support radius r to ^J'^tAmJt^^ where 



Am is the total area of a mesh, and a^. is 2%. RoPS 
feature descriptors were calculated at all points of the 
shapes, without any feature detection. We used the av- 
erage normalized L-i distance between the feature de- 
scriptors of corresponding poin ts to measure the qualit y 



of a feature descriptor, as in (jZaharescu et al.l . 120121 ) 



The experimental results of the RoPS descriptor are 
shown in Table [31 For comparison, the results of the 
MeshHOG descriptor (Gaussian curvature) without and 
with MeshDOG are also reported in Tables [H and [51 re- 
spectively. 

The RoPS descriptor was clearly invariant to color 
noise and color shot noise. Because the geometric in- 
formation used in RoPS cannot be affected by color 
deformations. RoPS was also invariant to rotation and 
scale, which means that it was invariant to rigid trans- 
formations. 

The RoPS descriptor turned out to be very robust to 
geometry noise, geometry shot noise, local scale, holes, 
micro-holes, topology and isometry with noise. The av- 
erage normalized L2 distances for all these transforma- 
tions were no more than 0.06, even under the high- 
est level of transformations. The biggest challenge for 
RoPS descriptor was sampling. The average normalized 
L2 distance increased from 0.01 to 0.06 as the strength 
level changed from 1 to 5. However, RoPS was more ro- 
bust to sampling than MeshHOG. As shown in Tables 
[21 and [31 the average normalized L2 distance of RoPS 
with a strength level of 5 was even smaller than that 
of MeshHOG with a strength level of 1, i.e., 0.02 and 
0.04, respectively. Overall, the average normalized L2 
distances of RoPS descriptor were much smaller under 
all strength levels of all transformations compared to 
MeshHOG. 



6 3D Object Recognition Algorithm 

So far we have developed a novel LRF and a RoPS 
feature descriptor. In this section, we propose a new hi- 
erarchical 3D object recognition algorithm based on the 
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Tabic 3: Robustness of RoPS descriptor. 

Strength 



Transform. 


1 


<2 


<3 


<4 


<5 


Color Noise 


0.00 


0.00 


0.00 


0.00 


0.00 


Color Shot Noise 


0.00 


0.00 


0.00 


0.00 


0.00 


Geometry Noise 


0.01 


0.01 


0.01 


0.02 


0.02 


Geometry Shot Noise 


0.01 


0.01 


0.02 


0.03 


0.05 


Rotation 


0.00 


0.00 


0.00 


0.00 


0.00 


Scale 


0.00 


0.00 


0.00 


0.00 


0.00 


Local Scale 


0.01 


0.01 


0.02 


0.02 


0.02 


Sampling 


0.01 


0.02 


0.04 


0.05 


0.06 


Holes 


0.01 


0.01 


0.01 


0.01 


0.02 


Marco-Holes 


0.00 


0.01 


0.01 


0.01 


0.01 


Topology 


0.01 


0.01 


0.02 


0.02 


0.03 


Isometry + Noise 


0.02 


0.02 


0.01 


0.02 


0.02 


Average 


0.00 


0.01 


0.01 


0.02 


0.02 



Table 4: Robustness of MeshHOG (Gaussian curvature) 
without MeshDOG detector. 

Strength 



Transform. 


1 


<2 


<3 


<4 


<5 


Color Noise 


0.00 


0.00 


0.00 


0.00 


0.00 


Color Shot Noise 


0.00 


0.00 


0.00 


0.00 


0.00 


Geometry Noise 


0.07 


0.08 


0.09 


0.10 


0.11 


Geometry Shot Noise 


0.02 


0.03 


0.05 


0.06 


0.09 


Rotation 


0.00 


0.00 


0.00 


0.00 


0.00 


Scale 


0.00 


0.00 


0.00 


0.00 


0.00 


Local Scale 


0.06 


0.07 


0.08 


0.09 


0.10 


Sampling 


0.10 


0.12 


0.13 


0.13 


0.13 


Holes 


0.01 


0.02 


0.04 


0.03 


0.05 


Marco-Holes 


0.01 


0.01 


0.03 


0.04 


0.04 


Topology 


0.07 


0.10 


0.11 


0.11 


0.12 


Isometry -|- Noise 


0.08 


0.08 


0.08 


0.09 


0.09 



Average 



0.04 0.04 0.05 0.06 0.06 



Table 5: Robustness of MeshHOG (Gaussian curvature) 
with MeshDOG detector. 



Strength 



Transform. 


1 


<2 


<3 


<4 


<5 


Color Noise 


0.00 


0.00 


0.00 


0.00 


0.00 


Color Shot Noise 


0.00 


0.00 


0.00 


0.00 


0.00 


Geometry Noise 


0.26 


0.29 


0.31 


0.33 


0.34 


Geometry Shot Noise 


0.04 


0.09 


0.14 


0.21 


0.29 


Rotation 


0.01 


0.01 


0.01 


0.01 


0.01 


Scale 


0.01 


0.01 


0.01 


0.01 


0.00 


Local Scale 


0.21 


0.25 


0.28 


0.30 


0.31 


Sampling 


0.31 


0.34 


0.34 


0.36 


0.36 


Holes 


0.02 


0.02 


0.07 


0.07 


0.07 


Marco-Holes 


0.01 


0.01 


0.07 


0.07 


0.08 


Topology 


0.13 


0.20 


0.22 


0.25 


0.28 


Isometry -|- Noise 


0.23 


0.24 


0.22 


0.25 


0.25 



Average 



0.10 0.12 0.14 0.15 0.17 



LRF and RoPS descriptor. Our 3D object recognition 
algorithm consists of four major modules, i.e., model 
representation, candidate model generation, transfor- 
mation hypothesis generation, verification and segmen- 
tation. A flow chart illustration of the algorithm is given 
in Fig. [TUl 



6.1 Model Representation 

We first construct a model library for the 3D objects 
that we are interested in. Given a model M, Nm seed 
points are evenly selected from the model pointcloud. 
Since the feature descriptors of closely located feature 
points may be similar (since they represent more or less 
the same loc al surface), a resolution control strategy 
( ZhonsJ . l2009r) is further enforced on these seed points to 
extract the final feature points. For each feature point 
p„, the LRF Fm and the feature descriptor (e.g., our 
RoPS descriptor) /^ are calculated. The point posi- 
tion p^ , LRF r„j and feature descriptor /„j of all the 
feature points are then stored in a library for object 
recognition. 

In order to speed up the process of feature matching 
during online recognition, the local feature descriptors 
from all model s are indexed using a k-d tree method 
( Bentlevl Il975l ). Note that, the model feature calcula- 
tion and indexing can be performed offline, while the 
following modules are operated online. 



6.2 Candidate Model Generation 

The input scene S is first decimated, which results in 
a low resolution mesh S' . The vertices of S which are 
nearest to the vertices of S' are s elected as seed poin ts 



(following a similar approach of ( Mian et al. . 2006b[) ) 



Next, a resolution control strategy (jZhongj, |2009|) 



IS en- 



forced on these seed points to prune ou t redundant 



seed points. A boundary checking strategy (JMian et al 



201(1 ) is also applied to the seed points to eliminate the 
boundary points of the range image. Further, since the 
LRF of a point can be ambiguous when two eigenval- 
ues of the overall scatter matrix of the underlying local 
surface (see Eq. U) are equal, we impose a constraint on 
the ratios of the eigenvalues ^i/as > t\ to exclu de seed - 
points with symmetric al local surfaces, as in (jZhong . 
2009tlMian et al.l . 120101 ). The remaining seed points are 
considered feature points. It is worth noting that, the 
feature point detection and LRF calculation procedures 
can be performed simultaneously. Given the LRF F^ of 
a feature point p^, its feature descriptor f^ is subse- 
quently calculated. 

The scene features are exactly matched against all 
model features in the library using the previously con- 
structed fc-d tree. If the ratio between the smallest dis- 
tance and the second smallest one is less than a thresh- 
old Tf, the scene feature and its closest model fea- 
ture are considered a feature correspondence. Each fea- 
ture correspondence votes for a model. These models 
which have received votes from feature correspondences 
are considered candidate models. They are then ranked 
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Fig. 10: Flow chart of the 3D object recognition algorithm. The module of model representation is performed 
offline, and the other modules are operated online. 



according to the number of votes received. With this 
ranked models, the subsequent steps (Sections 16.31 and 
16. 4p can be performed from the most likely candidate 
model. 



6.3 Transformation hypothesis Generation 

For a feature correspondence which votes for the model 
M , a rigid transformation is calculated by aligning the 
LRF of the model feature to the LRF of the scene fea- 
ture. Specifically, given the LRF Fs and the point po- 
sition p^ of a scene feature, the LRF F^ and the point 
position p„j of a corresponding model feature, the rigid 
transformation can be estimated by: 



R 



fTf, 



t^Ps PnA, 



(19) 
(20) 



where R is the rotation matrix and t is the translation 
vector of the rigid transformation. It is worth noting 
that a transformation can be estimated from a single 
feature correspondence using our RoPS feature descrip- 
tor. This is a major advantage of our algorithm com- 
pared with most of the existing algorithms (e.g., splash, 
point signatures and spin image based methods) which 
require at least three correspondences to calculate a 
transformation ( Johnson and Hebertl . Il999l) . Our algo- 
rithm not only eliminates the combinatorial explosion 
of feature correspondences but also improves the relia- 
bility of the estimated transformation. 

As all the plausible transformations (Ri,ti) ,i — 
1,2,- ■■ ,Nt between the scene S and the model M are 



calculated, these transformations are then grouped into 
several clusters. Specifically, for each plausible transfor- 
mation, its rotation matrix R^ is first converted into 
three Euler angles which form a vector Ui . In this man- 
ner, the difference between any two rotation matrices 
can be measured by the Euclidean distance between 
their corresponding Euler angles. These transformations 
whose Euler angles are around Ui (with distances less 
than Ta) and translations arc around ti (with distances 
less than Tt) are grouped into a cluster Ci. Therefore, 
each plausible transformation (R^, ti) results in a clus- 
ter Ci. The cluster center (Rc,tc) of Ci is calculated 
as the average rotation and translation in that cluster. 
Next, a confidence score Sr for each cluster is calculated 



d ' 



(21) 



where nj is the number of feature correspondences in 
the cluster, and d is the average distance between the 
scene features and their corresponding model features 
which fall within the cluster. These clusters are sorted 
according to their confidence scores, the ones with con- 
fidence scores smaller than half of the maximum score 
are first pruned out. We then select the valid clusters 
from these remaining clusters, starting from the high- 
est scored one and discarding the nearby clusters whose 
distances to these selected clusters are small (using Ta 
and Tt). Ta and Tf are empirically set to 0.2 and 30mr 
throughout this paper. These selected clusters are then 
allowed to proceed to the final verification and segmen- 
tation stage (Section [ 
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6.4 Verification and Segmentation 

Given a scene 5, a candidate model A4 and a transfor- 
mation hypotliesis (Rcic), tlie model Ai is first trans- 
formed to the scene S by using the transformation hy- 
pothesis (Re, ic)- This transformation is further refined 



using the ICP algorithm (JBesl and McKavl 119921 ). re- 
sulting in a residual error e. After ICP refinement, the 
visible proportion a is calculated as: 



(22) 



where ric is the number of corresponding points between 
the scene S and the model A^, rig is the total number 
of points in the scene S. Here, a scene point and a 
transformed model point are considered corresponding 
if their distance is l ess than twice the model resolution 
(JMianet al.[ l2006bl) . 

The candidate model A^ and the transformation hy- 
pothesis (Rc,ic) are accepted as being correct only if 
the residual error e is smaller than a threshold r^ and 
the proportion a is larger than a threshold t^. However, 
it is hard to determine the thresholds. Because selecting 
strict thresholds will reject correct hypotheses which 
are highly occluded in the scene, while selecting loose 
thresholds will produce many false positives. In this pa- 
per, a fiexible thresholding scheme is developed. To deal 
with a highly occluded but well aligned object, we se- 
lect a small error threshold Tgi together with a small 
proportion threshold Td. Meanwhile, in order to in- 
crease the tolerance to the residual error which resulted 
from an inaccurate estimation of the transformation, 
we select a relatively larger error threshold t^2 together 
with a larger proportion threshold Ta2 ■ We chose these 
thresholds empirically and set them as r^i = 0.75mr, 
Te2 = 1.5mr, Tai = 0.04 and Ta2 = 0.2 throughout the 
paper. 

Therefore, once e < r^i but a > Td, or e < t^2 but 
Q > Ta2, the candidate model M and the transforma- 
tion hypothesis (Rc,tc) are accepted, the scene points 
which correspond to this model are removed from the 
scene. Otherwise, this transformation hypothesis is re- 
jected and the next transformation hypothesis is veri- 
fied by turn. If no transformation hypothesis results in 
an accurate alignment, we conclude that the model A^ 
is not present in the scene S. While if more than one 
transformation hypotheses are accepted, it means that 
multiple instances of the model A4 are present in the 
scene S. 

Once all the transformation hypotheses for a candi- 
date model M are tested, the object recognition algo- 
rithm then proceeds to the next candidate model. This 
process continues until either all the candidate models 



have been verified or there are too few points left in the 
scene for recognition. 



7 Performance of 3D Object Recognition 

The effectiveness of our proposed RoPS based 3D object 
recognition algorithm was evaluated by a set of experi- 
ments on four dataset s, including the Bolo gna Dataset 



( Tombari et al.l . l2010[) . the UWA Dataset (IMian et al. 



' !201l| 



2006bl ). the Queen's Dataset ( Taati and Grecnsoan: 
and the Ca' Foscari Venezia Dataset (jRodola et al.i .i201 






These four datasets are amongst the most popular datasets 
publicly available, containing multiple objects in each 
scene in the presence of occlusion and clutter. 



7.1 Recognition Results on The Bologna Dataset 

We used the Bologna Dataset to evaluate the effective- 
ness of our proposed RoPS based 3D object recognition 
algorithm. We specifically focused on the performance 
with respect to noise and varying mesh resolution. We 
also aimed to demonstrate the capability of our 3D ob- 
ject recognition algorithm to integrate the existing fea- 
ture descriptors without LRF. 

We used our RoPS together with the five feature 
descriptors (as detailed in Section IS.l.ip to perform 
object recognition. For feature descriptors that do not 
have a dedicated LRF, e.g., spin image, NormHist, LSP 
and THRIFT, the LRFs were defined using our pro- 
posed technique. The average number of detected fea- 
ture points in an unsampled scene and a model were 
985 and 1000, respectively. 

In order to evaluate the performance of the 3D ob- 
ject recognition algorithms on noisy data, we added a 
Gaussian noise with increasing standard deviation of 
O.lmr, 0.2mr, 0.3mr, 0.4mr and 0.5mr to each scene 
data, the average recognition rates of the six algorithms 
on the 45 scenes are shown in Fig. ITlT a'). It can be seen 
that both RoPS and SHOT based algorithms achieved 
the best results, with recognition rates of 100% under 
all levels of noise. Spin image and NormHist based al- 
gorithms achieved recognition rates higher than 97% 
under low-level noise with deviations less than O.lmr. 
However, their performance deteriorated sharply as the 
noise increased. While LSP and THRIFT based algo- 
rithms were very sensitive to noise. 

In order to evaluate the effectiveness of the 3D ob- 
ject recognition algorithms with respect to varying mesh 
resolution, the 45 noise free scenes were resampled to 
1/2, 1/4 and i/s of their original mesh resolution. The 
average recognition rates on the 45 scenes with respect 
to different mesh resolutions are given in Fig. [TTT b). 
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Decimation 




0.2 0.3 

Noise deviation (mr) 



(a) Recognition rates in the presence of noise (b) Recognition rates with respect to varying mesh 

resolution 

Fig. 11: Recognition rates on the Bologna Dataset. (Figure best seen in color.) 




(a) Chef (b) Chicken (c) Parasaurolophus (d) Rhino 

Fig. 12: The five models of the UWA Dataset. 



(e) T-Rex 




(a) The first sample scene (b) Our recognition result (c) The second sample scene (d) Our recognition result 

Fig. 13: Two sample scenes and om- recognition results on the UWA Dataset. The correctly recognized objects have 
been superimposed by their 3D complete models from the library. All objects were correctly recognized except for 
the T-Rex in (d). (Figure best seen in color.) 



It is shown that RoPS based algorithm achieved the 
best performance, obtaining 100% recognition rate un- 
der all levels of mesh decimation. It was followed by 
NormHist and spin image based algorithms. That is, 
they obtained recognition rates of 97.8% and 91.1% re- 
spectively in scenes with i/s of original mesh resolution. 



7.2 Recognition Results on The UWA Dataset 



The UWA Dataset contains five 3D models and 50 real 
scenes. The scenes were generated by randomly placing 
four or five real objects together in a scene and scanned 
from a single viewpoint using a Minolta Vivid 910 scan- 
ner. An illustration of the five models is given in Fig. 
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1121 and two sample scenes are shown in Figures I13r a) 
and (c). 

For the sake of consistency in comparison, RoPS 
based 3D object rec ognition experimen ts we re performed 



on the same data as lMian et aLl (|2006bl ) and lBariva et al. 



(J2012r) . Besides, the Rhino model was excluded from the 
recognition results, since it contained large holes and 
cannot be recognized by the spin image based algorithm 
in any of the scenes. Comparison was performed with 
a number of state-o f-the-art algori thms, such a s tenso r 
(iMianet al.ll2006bl). 



spin image (IMian et al. . 2006bl) 



keyp oint ( Mian et al.l . l2010r). VD-LS D ( Taati and Greenspan 



20 111 ) and EM based |Bariva et al.l . I2OI2I) algorithms 




Comparison results are shown in Fig. [TJ] with respect 
to varying levels of occlusion. The average number of 
detected feature points in a scene and a model were 
2259 and 4247, respectively. 



Occlusion is defined according to lJohnson and Hebert 
as: 



occlusion 



model surface patch area in scene 
total model surface area 



(23) 



The ground truth occlusion values were automati- 
cally calculated for the correctly recognized objects and 
manually calculated for the objects which were not cor- 
rectly recognized. As shown in Fig.[T31 our RoPS based 
algorithm outperformed all the existing algorithms. It 
achieved a recognition rate of 100% with up to 80% oc- 
clusion, and a recognition rate of 93.1% even under 85% 
occlusion. The average recognition rate of our RoPS 
based algorithm was 98.8%, while the average recogni- 
tion rate of spin image, tensor and EM based algorithms 
were 87.8%, 96.6% and 97.5% respectively, with up to 
84% occlusion. The overall average recognition rate of 
our RoPS based algorithm was 98.9%. Moreover, no 
false positive occurred in the experiments when using 
our RoPS based algorithm, and only two out of the to- 
tal 188 objects in the 50 scenes was not correctly recog- 
nized. These results confirm that our RoPS based algo- 
rithm is able to recognize objects in complex scenes in 
the presence of significant clutter, occlusion and mesh 
resolution variation. 

Two sample scenes and their corresponding recogni- 
tion results are shown in Fig. \W\ All objects were cor- 
rectly recognized and their poses were accurately recov- 
ered except for the T-Rex in Fig. [TSl d'). The reason for 
the failure in Fig. [T^ d) relates to the excessive occlu- 
sion of the T-Rex. It is highly occluded and the visible 
surface is sparsely distributed in several parts of the 
body rather than in a single area. Therefore, almost no 
reliable feature could be extracted from the object. 

Note that, although we used a fixed support ra- 
dius (i.e., r — 15mr) for feature description throughout 
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Fig. 14: Recognition rates on the UWA Dataset. (Figure 
best seen in color.) 

this paper, the proposed algorithm is generic, and dif- 
ferent adaptive-scale keypoint detection methods can 
be seamlessly integrated within our RoPS descriptor. 
In order to further demonstrate the generic nature of 
our algorithm, we generated RoPS descriptors using the 
support radi i estim ated by the adaptive-scale method in 



( Mian et al.l . 120101 ) . The recognition result is shown in 



Fig. [TH The recognition performance of the adaptive- 
scale RoPS based algo r ithm was better than that re- 
ported in ( Mian et al.l 120101) . which means that our 
RoPS descri ptor was more des criptive than the descrip- 



tor used in ( Mian et al.l . 120101 ) . It is also observed that 



the performance of adaptive-scale RoPS was marginally 
worse than the fixed-scale counterpart. This is because 
the errors of scale estimation adversely affected the per- 
formance of feature matching, and ultimately object 
recognition. That is, the corresponding points in a scene 
and model may have different estimate d scales due to 



the e stimation errors. As reported in ([Tombari et al, 
20131 ). the scale repea t ability of the adaptive-scale de- 



tector in (jMian et al.l l2010f ) were less than 85% and 
60% on the Retrieval dataset and Random Views dataset, 
respectively. 



7.3 Recognition Results on The Queen's Dataset 

The Queen's Dataset contains five models and 80 real 
scenes. The 80 scenes were generated by randomly plac- 
ing one, three, four or five of the models in a scene and 
scanned from a single viewpoint using a LIDAR sen- 
sor. The five models were generated by merging several 
range images of a single object. Since all scenes and 
models were represented in the form of pointclouds, 
we first converted them into triangular meshes in or- 
der to calculate the LRFs using our proposed tech- 
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(a) Angle (b) Big Bird (c) Gnome (d) Kid (e) Zoe 

Fig. 15: The five models in the Queen's Dataset. 





(a) The first sample scene (b) Our recognition result (c) The second sample scene (d) Our recognition result 

Fig. 16: Two sample scenes and our recognition results on the Queen's dataset. The correctly recognized objects 
have been superimposed by their 3D complete models from the library. All objects were correctly recognized except 
for the Angle in (d). (Figure best seen in color.) 



nique. A scene pointcloud was converted by mapping 
the 3D pointcloud onto the 2D retina plane of the sen- 
sor and performing a 2D Delaunay triangulation over 
the mapped points. The 2D points and triangles were 
then mapped back to the 3D space, resulting in a tri- 
angular mesh. A model pointcloud was converted into 
a triangular mesh using the Ma rching Cubes algorithm 
( Guennebaud and Grossl 120071) . An illustration of the 
five models is given in Fig. [TSl and two sample scenes 
are shown in Figures [TCT a) and (c). 

First, we performed object recognition using our 
RoPS based algorithm on the full dataset which con- 
tains 80 real scenes. The average number of detected 
feature points in a scene and a model were 3296 and 
4993, respectively. The results are shown in parenthe- 
ses in_^HeJSlwith_a comparison to the results given 
by iBariva et al.l (J2012r) . It can be seen that the average 
recognition rate of our algorithm is 95.4%, in contrast, 
the average recognition rate of the EM based algorithm 
is 82.4%. These results indicate that our algorithm is 
superior to the EM based algorithm although a com- 
plicated keypoint detection and scale selection strategy 
has been adopted by the EM based algorithm. 

To make a direct compariso n with the results given 
bv lTaati and Greenspanl(|2011[ ). we performed our RoPS 
based 3D object recognition on the same subset dataset 
which contains 55 scenes. The results are given in Ta- 
ble [HI with comparisons to the results provided by two 



variants of VD-LSD, 3DSC and four variants of spin 
image. As shown in Table. [6j our average recognition 
rate was 95.4%, while the second best result achieved 
by VD-LSD (SQ) was 83.8%. The RoPS based algo- 
rithm achieved the best recognition rates for all the five 
models. More than 97% of the instances of Angle, Big 
Bird and Gnome were correctly recognized. Although 
RoPS's recognition rate for Zoe was relatively low (i.e., 
87.2%), it still outperformed the existing algorithms by 
a large margin, since the second best result achieved 
by VD-LSD (SQ) was 71.8%. Fig. [16] shows two sam- 
ple scenes and our recognition results on the Queen's 
Dataset. It can be seen that our RoPS based algorithm 
was able to recognize objects with large amounts of oc- 
clusion and clutter. 

Note that, the Queen's Dataset is more challenging 
than the UWA Dataset since the former is more noisy 
and the points are not uniformly distributed. That is 
the reason why the spin image based algorithm had a 
significant drop in the recognition performance when 
tested on the two datasets. Specifically, the average 
recognition rate of spin image based algorithm on the 
UWA Dataset was 87.8% while the best result on the 
Queen's Dataset was only 54.4%. Similarly, a notable 
decrease of performance can also be found for the EM 
based algorithm, with 97.5% recognition rate for the 
UWA Dataset and 81.9% recognition rate for the Queen's 
Dataset. However, our RoPS based algorithm was con- 
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Table 6: Recognition rates (%) on the Queen's Dataset. The resuhs of the tests on the fuh dataset containing 
80 scenes are shown in parentheses. The others were tested on a subset dataset which contains 55 scenes. 'NA' 
indicates that the corresponding item is not available. The best results are in bold fonts. 



Method 



Angel 



Big Bird 



Gnome 



Kid 



Zoe 



Average 



RoPS 


97.4 (97.9) 


100.0 (100.0) 


97.4 (97.9) 


94.9 (95.8) 


87.2 (85.4) 


95.4 (95.4) 


EM 


NA (77.1) 


NA (87.5) 


NA (87.5) 


NA (83.3) 


NA (76.6) 


81.9 (82.4) 


VD-LSD(SQ) 


89.7 


100.0 


70.5 


84.6 


71.8 


83.8 


VD-LSD(VQ) 


56.4 


97.4 


69.2 


51.3 


64.1 


67.7 


3DSC 


53.8 


84.6 


61.5 


53.8 


56.4 


62.1 


Spin image (impr.) 


53.8 


84.6 


38.5 


51.3 


41.0 


53.8 


Spin image (orig.) 


15.4 


64.1 


25.6 


43.6 


28.2 


35.4 


Spin image spherical (impr.) 


53.8 


74.4 


38.5 


61.5 


43.6 


54.4 


Spin image spherical (orig.) 


12.8 


61.5 


30.8 


43.6 


30.8 


35.9 



sistently effective and robust to different kinds of varia- 
tions (including noise, varying mesh resolution and oc- 
clusion), it outperformed the existing algorithms and 
achieved comparable results in both datasets, obtain- 
ing a recognition rate of 98.9% on the UWA Dataset 
and 95.4% on the Queen's Dataset. 

We also performed a timing experiment to measure 
the average processing time to recognize each object 
in the scene. The experiment was conducted on a com- 
puter with a 3.16 GHz Intel Core2 Duo CPU and a 4GB 
RAM. The code was implemented in MATLAB without 
using any program optimization or parallel computing 
technique. The average computational time to detect 
feature points and calculate LRFs was 42.6s. The aver- 
age computational time to generate RoPS descriptors 
was 7.2s. Feature matching consumed 46.6s, while the 
computational time for the transformation hypothesis 
generation was negligible. Finally, verification and seg- 
mentation cost 57.4s in average. 



7.4 Recognition Results on The Ca' Foscari Venezia 
Dataset 

This dataset is composed of 20 models and 150 scenes. 
Each scene contains 3 to 5 objects in the presence of 
occlusion and clutter. Totally, there are 497 object in- 
stances in all scenes. This dataset has been released 
just recently. It is the largest available 3D object recog- 
nition dataset. It is also more challenging than many 
other datasets, containing several models with large flat 
and featureless ar eas, and several mod els which arc very 



similar in shape (IRodola et al.l . 120121 ) 



The precision and recall values of RoPS based algo- 
rithm on this dataset is shown in Table [71 the results as 



reported in (IRodola et al.l. 120121) are a lso reported for 



comparison. As in (JRodola et al.l . 120121 ). two out of the 
20 models were left out from the recognition tests and 
used as clutter. The average number of detected feature 
points in a scene and a model were 2210 and 5000, re- 



spectively. The RoPS based algorithm achieved b etter 
precision results compared to (JRodola et al.l . l2012l) . The 
average precision of Ro PS based algorithm was 99%, 
which was higher than (JRodola et al.l . 120121 ) by a mar- 
gin of 6%. Besides, the precision values of 14 individual 
models were as high as 100%. 

The average recall of RoPS based algorithm was 



96%, in contrast, the average recall of (JRodola et al. 



2012) was 95%. Moreover, RoPS based algorithm achieved 
equal or better recall values on 17 individual models 
out of the 18 models. Note that, SHOT d escriptors and 



a gam e-theoretic framework is used in (JRodola et al. 



2012t ) for 3D object recognition. It is observed that our 
RoPS based algorithm performed better than SHOT 
based algorithm on this Dataset. 

In summary, the superior performance of our RoPS 
based 3D object recognition algorithm is due to several 
reasons. First, the highly descriptiveness and strong ro- 
bustness of our RoPS feature descriptor improve the ac- 
curacy of feature matching and therefore boost the per- 
formance of 3D object recognition. Second, the unique, 
rcpeatable and robust LRF enables the estimation of 
a rigid transformation from a single feature correspon- 
dence, which therefore reduces the errors of transforma- 
tion hypotheses. This is because the probability of se- 
lecting only one correct feature correspondence is much 
higher than the probability of selecting three correct 
feature correspondences. Moreover, our proposed hi- 
erarchical object recognition algorithm enables object 
recognition to be performed in an effective and efficient 
manner. 



8 Conclusion 

In this paper, we proposed a novel RoPS feature de- 
scriptor for 3D local surface description, and a new hi- 
erarchical RoPS based algorithm for 3D object recog- 
nition. The RoPS feature descriptor is generated by ro- 
tationally projecting the neighboring points around a 
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Table 7: 


Precision and r 


ecall values 


on the Ca' 


Foscari Venezia Dataset. 


The best results are 


in bold fonts. 






Armadillo 


Bunny 


Catl 


Centaurl 


Chef 


Chicken 


Dog7 


Dragon 


Face 


Precision 


RoPS 

Game-theoretic 


97 
100 


100 

100 


100 

78 


100 

96 


100 

93 


97 

93 


100 

95 


100 

100 


100 

91 


Recall 


RoPS 

Game-theoretic 


100 

97 


100 

97 


44 
82 


100 

100 


100 

100 


100 

100 


91 

86 


100 

89 


100 

95 






Ganesha 


GorillaO 


Horse7 


Lionessl3 


Para 


Rhino 


T-Rex 


Victoria3 


Wolf2 


Precision 


RoPS 

Game-theoretic 


100 

89 


100 

95 


100 

97 


100 

88 


97 

97 


96 

91 


100 

97 


100 

83 


100 

82 


Recall 


RoPS 

Game-theoretic 


100 

100 


100 

91 


100 

100 


100 

100 


97 

94 


100 

91 


100 

97 


95 

83 


100 

95 



feature point onto three coordinate planes and calcu- 
lating the statistics of the distribution of the projected 
points. We also proposed a novel LRF by calculating 
the scatter matrix of all points lying on the local sur- 
face rather than just mesh vertices. The unique and 
highly repeatable LRF facilitates the effectiveness and 
robustness of the RoPS descriptor. 

We performed a set of experiments to assess our 
RoPS feature descriptor with respect to a set of differ- 
ent nuisances including noise, varying mesh resolution 
and holes. Comparative experimental results show that 
our RoPS descriptor outperforms the state-of-the-art 
methods, obtaining high descriptiveness and strong ro- 
bustness to noise, varying mesh resolution and other 
deformations. 

Moreover, we performed extensive experiments for 
3D object recognition in complex scenes in the presence 
of noise, varying mesh resolution, clutter and occlusion. 
Experimental results on the Bologna Dataset show that 
our RoPS based algorithm is very effective and robust 
to noise and mesh resolution variation. Experimental 
results on the UWA Dataset show that RoPS based 
algorithm is very robust to occlusion and outperforms 
existing algorithms. The recognition results achieved on 
the Queen's Dataset show that our algorithm outper- 
forms the state-of-the-art algorithms by a large mar- 
gin. The RoPS based algorithm was further tested on 
the largest available 3D object recognition dataset (i.e., 
the Ca' Foscari Venezia Dataset), reporting superior 
results. Overall, our algorithm has achieved significant 
improvements over the existing 3D object recognition 
algorithms when tested on the same dataset. 

Interesting future research directions include the ex- 
tension of the proposed RoPS feature to encode both 
geometric and photometric information. Integrating ge- 
ometric and photometric cues would be beneficial for 
the recognition of 3D objects with poor geometric but 
rich photometric features (e.g., a flat or spherical sur- 
face). Another direction is to adopt our RoPS descrip- 
tors to perform 3D shape retrieval on a large scale 3D 



shape corpus, e.g., the SHREC Datasets (iBronstein et al. . 
2010bh . 
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